Part 1: Selecting an Expression Dataset

Using GEOmetadb to query datasets of interest and selecting relavent ones (dont evaluate on knit)

if(!file.exists('GEOmetadb.sqlite')) 
  GEOmetadb::getSQLiteFile()
con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite', 
                      synchronous = NULL)
Geo_tables <- DBI::dbListTables(con)
Geo_tables

Query datasets with high throughput seqeuncing and RNA-Seq in title. I also browsed through the GEO interactive search to find the dataset.

sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.title LIKE '%RNA-Seq%' AND",
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",sep=" ")
result <- DBI::dbGetQuery(con,sql)

Part 2: Data cleaning (Chosen Dataset - GSE148372)

Now that the dataset of interest is chosen, retrieve the metadata and supplementary files of interest.

chosen_gse <- invisible(GEOquery::getGEO("GSE148372", GSEMatrix = FALSE))
chosen_gse_info <- data.frame(head(GEOquery::Meta(chosen_gse)))

chosen_geo <- names(GEOquery::GPLList(chosen_gse))[1]
chosen_geo <- GEOquery::Meta(GEOquery::getGEO(chosen_geo))
chosen_geo$title
## [1] "Illumina NovaSeq 6000 (Homo sapiens)"
chosen_geo$submission_date
## [1] "Mar 02 2018"
chosen_geo$last_update_date
## [1] "Nov 05 2018"
chosen_geo$organism
## [1] "Homo sapiens"
length(chosen_geo$series_id)
## [1] 7818
length(chosen_geo$sample_id)
## [1] 169121

Platform information on chosen dataset:

Title : Illumina NovaSeq 6000 (Homo sapiens)
Submission data : Mar 02 2018
Last update data : Nov 05 2018   Organism : Homo sapiens
Number of GEO datasets that use this technology : 7818
Number of GEO samples that use this technology : 169121 Contact Location : USA

The dataset only has one supplementary file with counts, this code simply looks into some initial info and stores the csv dataframe in memory.

if (!exists('sfiles'))
  sfiles <- GEOquery::getGEOSuppFiles('GSE148372', filter_regex = "csv")
fnames <- rownames(sfiles)
fnames
## [1] "/home/rstudio/GSE148372/GSE148372_ko.count.csv.gz"
RNAfile <- read.csv(fnames[1], check.names = FALSE)
colnames(RNAfile)[1] <- "gene" 
RNAfile
colnames(RNAfile)
##  [1] "gene"                 "HCC1806.wildtype.g2"  "HCC1806.wildtype.g1" 
##  [4] "HCC1806.PMVKko.g1"    "HCC1806.PMVKko.g2"    "HCC1806.SCAPko.g1"   
##  [7] "HCC1806.SCAPko.g2"    "HCC1806.SREBF1ko.g1"  "HCC1806.SREBF1ko.g2" 
## [10] "HCC1954.wildtype.g2"  "HCC1954.wildtype.g1"  "HCC1954.SREBF1ko.g1" 
## [13] "HCC1954.SREBF1ko.g2"  "JIMT1.wildtype.g2"    "JIMT1.wildtype.g1"   
## [16] "JIMT1.PMVKko.g1"      "JIMT1.PMVKko.g2"      "JIMT1.SCAPko.g1"     
## [19] "JIMT1.SCAPko.g2"      "JIMT1.SREBF1ko.g1"    "JIMT1.SREBF1ko.g2"   
## [22] "MDAMB231.wildtype.g2" "MDAMB231.wildtype.g1" "MDAMB231.SREBF1ko.g2"
dim(RNAfile)
## [1] 23686    24

Part 3: Mapping to HUGO symbols

This section will look into generating the HGNC symbol mapping for each gene in the data.
Since the dataset does not have ensemble ids attached to it, the gene identifier is used to try and match the symbols.

if(!exists('ensembl')){
  ensembl <- biomaRt::useMart(biomart = "ensembl", dataset="hsapiens_gene_ensembl")
}
  geneID_map <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'),
                                 filters = 'ensembl_gene_id',
                                 values = RNAfile$gene,
                                 mart = ensembl)

dim(geneID_map)
## [1] 0 2

As seen above, the geneId map does not result in any symbols to match with. Since there is not ensemble id
attached to the dataset, the gene identifiers will be used in this report instead of HGNC symbols.

Part 4: Normalization

grouping <-lapply(colnames(RNAfile)[2:length(colnames(RNAfile))], function (x) {unlist(strsplit(x, split='\\.'))[1]})
RNA_matrix <- data.frame(RNAfile, check.names = F)
colnames(RNA_matrix) <- c('gene', grouping)
df <- RNA_matrix[, -1]
colnames(df) <- grouping
df <- t(rowsum(t(df), group = colnames(df), na.rm = T ))
colnames(df)
## [1] "HCC1806"  "HCC1954"  "JIMT1"    "MDAMB231"
grouping <- data.frame(unique(grouping))
colnames(df)
## [1] "HCC1806"  "HCC1954"  "JIMT1"    "MDAMB231"
# create DEGList and calculate normalization factors:
dge = edgeR::DGEList(counts=data.matrix(df), group=grouping)
dge = edgeR::calcNormFactors(dge, method = "TMM")
normalized_counts <- edgeR::cpm(dge)
normalized_counts <- data.frame(cbind(data.frame(normalized_counts), gene=RNAfile$gene))
normalized_counts

Steps taken to normalize the counts data -

  1. Define groupings for dataset:
  • The sample has been grouped by 4 different types of cell lines: “HCC1806” “HCC1954” “JIMT1” “MDAMB231” as the experiment focuses on these 4 cell lines
  1. Next, the DEGList object is created and we use the TMM method to normalize the counts:
  • The reason for using TMM is to account for the size variation between samples of interest in the library and accounting for the fact that some genes that are extremely differently expressed will skew results.
  1. The normalized counts are then attached along with the gene identifier column to effectively use the data later in analyses.

Part 5: Visualization

MDS plot to see relations between counts of different cell lines in the experiment

## Loading required package: limma

It is evident that the HCC1954 and MDAMB231 cell lines are more represented in this dataset while JIMT1 and HCC1806 are more closely represented.

Gene representation for each cell line

par(mfrow=c(2,2))
barplot(normalized_counts$HCC1806, names.arg = normalized_counts$gene, main="HCC1806")
barplot(normalized_counts$HCC1954, names.arg = normalized_counts$gene, main="HCC1954")
barplot(normalized_counts$JIMT1, names.arg = normalized_counts$gene, main="JIMT1")
barplot(normalized_counts$MDAMB231, names.arg = normalized_counts$gene, main="MDAMB231")

Part 6: Questions

What are the control and test conditions of the dataset?

MDAMB231, HCC1806, HCC1954, and JIMT1 cell lines are controls in the dataset. They are compared with infected CRISPR/Cas9-expressing versions of themselves in the dataset.

Why is the dataset of interest to you?

The study mentions that a large subset of deaths from cancers are attributed to metastasis. Since large scale research in this field is impractivcal the use of human cells in mouse xenografts can allow to scale this.

Were there expression values that were not unique for specific genes? How did you handle these?

No, the counts dataset had already one row per gene for every cell line in the experiment.

Were there expression values that could not be mapped to current HUGO symbols?

Yes, the dataset did not come with ensemble ids and therefore had to use the gene symbols already present in the dataset for the analysis.

How many outliers were removed?

The dataset is not too large and did not have much variation for obvious outliers and therefore none have been removed.

How did you handle replicates?

Replicate columns were merged in by summing up count values.

What is the final coverage of your dataset?

The final dataset includes 23,686 rows of normalized counts for each of the cell lines.

References